DOCUMENT RESUME 



ED 041 937 



TM 000 029 



AUTHOR 

TITLE 

SPONS AGENCY 
PUB DATE 
NOTE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Pinsky, Paul D. 

A Mathematical Model of Testing for Instructional 
Management Purposes. 

Charles F. Kettering Foundation, Dayton, Ohio. 

Mar 70 

19p. ; From symposium "Designing Instructional 
Systems with longitudinal Testing Using Item 
Sampling Techniques." (Annual meeting, American 
Educational Research Association, Minneapolis, 
Minn. , March 1 970) 

EDRS Price MF-$0.25 HC-S1.05 

Behavioral Objectives, Instructional Improvement, 
Management, ^Mathematical Models, Statistical 
Analysis, Symposia, *Test Construction, *Testing 
♦Comprehensive Achievement Monitoring 



ABSTRACT 

Developing a student testing mathematical model for 
instructional management purposes necessitates clear structuring of 
the curriculum materials involved, whether designated in the domain 
of content or the dimension of concepts or skills. Such structuring 
of a course written in performance objectives is presented and noted 
to be helpful in making decisions concerning the construction of 
tests over time and in understanding the inter-relationships of the 
parts of the curriculum from the testing results. An algorithm to 
select test items used to estimate desired parameters is developed. 
Inputs into the model are the average time required to answer each 
item, the errors of measurement associated with each item, the 
relative value of the information provided, the prior knowledge of 
this information, and a value function on the accuracy of the 
resultant estimates. Finally, techniques are given to allocate test 
items to students in such a way as to generate simultaneous estimates 
of item and student group characteristics. References are included. 
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Considering a course written in performance objectives, the paper 
explores the information needed by a teacher for instructional management 
purposes. Then, based upon these needs at a fixed point in time, an algorithm 
is developed to select the test items to be used to estimate the desired 
parameters. Inputs into the model are the average time required to answer 
each item, the errors of measurement associated with each item, the relative 
value of the information provided, the prior knowledge of this information, and 
a value function on the accuracy of the resultant estimates. Finally, tech- 
niques are given to allocate test items to students in such a way as to generate 
simultaneous estimates of item and student group characteristics. 
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1*0 Introduction 

Consider a teacher who wishes to impart some knowledge to a group of 
students. This teacher has a brother-in-law who is an operations researcher 
for a management information firm. At a family summer outing, the teacher 
overheard his brother-in-law talking about information feedback, control 
theory, decision analysis, and a whole raft of unfamiliar jargon. As the 
teacher was on vacation for the summer, he decided to find out about all these 
powerful techniques and apply them to his teaching in the fall. After all, 
if his brother-in-law could increase efficiency by 100% or more in the business 
and industrial world, why couldn't he do the same in the classroom? Think, 
only 30 minutes a day for each class instead of the usual 60! After a few 
years of practice, he could write a book, form his own company, and retire 
with the knowledge that he had done his share to save mankind. 

In order to develop a model of testing students for instructional 
management purposes, it is necessary to explore the types of information 
that would be useful to a teacher in a classroom environment. First of all, 
the structure of the material the instructor is attempting to teach must be 
clearly defined. This material will be viewed in the content domain in a 
structure as shown in figure 1 for the model presented herein. Most courses 
currently defined in performance objectives do not recognize the concept of 
measurable objectives (MO's). However, in a course containing 100 or more 
performance objectives, it is not feasible to obtain statistically reliable 
estimates of achievement levels in all or most of the objectives at the same 
time. This phenomenon (the bandwidth-fidelity dilemma) is explained in more 
S«cail in Cronbach [3], Chapter 8. Therefore the course in figure 1 has been 
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organized into groups of performance objectives large enough to yield reli- 
able statistical estimates, but small enough to yield information valuable 
for instructional management purposes. CAM is presently using the performance 
objectives covered in a week’s time as the set of measurable objectives. As 
w iH become clear later on, this concept of measurable objectives does not 
cause any loss of information concerning the individual performance objectives 
This is important to note because data relating to performance objectives, 
although not statistically reliable, can still be quite meaningful to the 



teacher and student. 

For notational purposes, the units will be indexed by a { U^,a = 1,* ,# ,A}, 
an d measurable objectives by a (the unit they are in) and b { MO ab> a ” 
b “ 1,***,B }. Note that there can be different numbers of MO's per unit. 
Viewing the course in the structur€s as presented above is helpful for making 
decisions concerning the construction of tests over time and in understanding 
the inter-relationships of the parts of the curriculum from the results obtained 
by testing. Furthermore, it is assumed that the teacher has a presentation 
strategy; that is, a plan of the order in which the material will be presented 



to the class. 



One can also view the curriculum in a second dimension, that of concepts 
or skills. An example of this dimension is the taxonomy developed by Bloom [1] 
However, because of the difficulty in applying such classification schemes to 
curriculum and the increase in mathematical complexity in the present formula- 
tion, only the single dimensional classification by content will be considered 



here, 



In order to discuss the possible goals of teaching a curriculum, some 
parameters must be defined. Let P gbg the achievement level of student s 
in M0 ftb . A convenient definition of P abg is the expected percentage of 
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correct responses by students to the universe of items that measures achieve- 
ment in MO , . The estimate of P , can then be obtained by sampling items 
ab abs 

1 s 

from the universal pool. P ab = j Z P abg is the average achievement level 

s— 1 

of the whole class in M0 gb . Now consider the goal of the teacher as maximizing 
F({ P ,a - 1, • • • , A,b - 1, • • • ,B ,s - 1,***,S}). The following are specific 
examples of F(») that teachers may choose: 

. B 
A a 

F(-) - Z Z P . 

a=l b“l ab. 

This function values learning in all MO's and by all students equally. 



F( *> ■ jl il a ab P ab, 



This function values learning by all students equally, but values 
learning in MC ftb according to the weight a ab . 



F(-) 



A 

Z 



max 



a*l b«*l, 



ab, 



This function values learning one and only one MO within each unit 
very well. 



F(.) - 



A 

Z 



min 



a-1 b*»l, • •• ,B 



P ab. 



This function values learning at least something about every MO in 
each unit. 



The above are but a few examples of possible F(*). Others could differentiate 
between students, be non-linear functions of P abg , or compare P ab at the 
end with P &b at the beginning of the course. 




This digression into objective functions is presented to emphasize 
that a teacher should have a specific idea of his goals in order to make 
intelligent decisions regarding instructional strategies. In the next section, 
an algorithm is presented for the selection of test-items to be given to 
students at a fixed point in time. This algorithm requires the teacher to 
input the value of certain information to him. Without a clearly defined set 
of goals, the teacher may have a difficult time selecting the appropriate 
values • 



The reason that one needs an algorithm to select items for testing 

purposes is that there exists only a finite amount of time for such testing. 

In this paper, testing is considered as being done at fixed intervals (called 

periods- —say once a week) for a fixed amount of time each period (T— say 30 

minutes). Varying the frequency and length of testing during the school year 

is not discussed, but the present model does not exclude such possibilities. 

Finally consider the types of information that a teacher might like to 

have for each M0 ab . The value of P abg ,s ** 1****,S before instruction on 

MO , would certainly be useful. So would the values immediately after 
abi 

instruction and several months after instruction (the retention level). One 

of the problems is that P abg I® very difficult to estimate in a statistically 

reliable fashion due to sampling and measurement errors. P . is easier to 

measure and is the quantity that will be estimated. P gb defined contains 

information about all the students in the course. Section 4.0 explains how 

estimates of I P , ,g “ 1 •••,G can be simultaneously generated where U 

SeUg abs 6 

is a subset of all students. Examples of U are various sections of the 

O 

course which were exposed to different stimuli!, or different achievement level 



groups. 
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2.0 Item Selection Algorithm 



As seen in the previous section, a teacher could use information about 
P , ,a - l, 9 »»,A,b = 1,«««,B during each period. However, good estimates of 

&D • ci 

P^k require many test items, and. there are only T units of testing time 
for each student during a period. Therefore, it is necessary to efficiently 
select the items to be used on the tests for each period. The algorithm to be 
developed will generate n^, the number of items to be used from MO^* The 
actual selection of items is then done on a stratified random basis, the 



ab 



8 and MQ^ contains 4 performance objectives, then 2 items 



stratification being done on the performance objectives within the MO. Thus, 
if n 

i 

will be selected at random from each performance objective within MO^* 

Assume that there are S students in the course. Enough items can be 
selected to consume T units of time, and the same items given to all the 
students, or enough items can be selected to consume S»T units of time, and 
a different 'set of items given to each student. In general, one can select 
enough items to consume L*T units of time, 1 < L l S, and give each set 
of items to S/L students. The following thoughts should be considered in the 
selection of L. When L - 1, then students can more easily be compared 
(all respond to the same items) than when L « S; while when L « 1, estimates 
of P , will contain more variance (due to sampling errors) than when L = S. 

If L « S, then a huge number of items is needed to consume S»T units of 
time, and the cost of producing the actual test forms can be enormous. Moreover, 
if each student takes a different set of items, then the item and student 
characteristics will be confounded. Section 4.0 examines the problem of obtain- 
ing simultaneous information about student and item characteristics. In present 
projects undertaken by the CAM staff with S - 100 to 150, L has ranged from 
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8 to 10. This value of L enables one to do considerable sampling within 
each MO, while still retaining much of the power for comparing groups of 
students. 

What criteria can be used to select items from the various MO f s? 

The present algorithm uses the following five, 

a) The average time required to answer an item pertaining to M0 ab . 

Denote this by t ab* 

b) The value of the information (K ab ) on M0 ab during the current 
period. For instance, if M0 ab is not being taught for several months, then 
K ab will be low, while if M0 ab was taught last week, then K ab will be 
high. { K , ,a * 1,«. .,A,b ■ 1,«**,B > is an input into the model and presently 
is viewed as being subjectively derived. Further theoretical developments may 
enable K ftb to be determined objectively. 

c) The prior knowledge of P , . If a teacher is quite confident that 

ED o 

p is very low, then there may be no need to use any items from MO ,. This 

ED • 

information is input into the model as a binomial prior (n f ,r f ), where the 
prior expected value of P fib is r'/n* and the variance is (r ? /a') (l-r’/n 1 ) 
/(n* + 1). An excellent discussion of binomial priors is given in Introduction 
to Statistical Decision Theory (ISDT) [7]. 

d) The errors of sampling and measurement of items relating to M0 ab * 
See section 2.0. 

e) A value function indicating the value of how close the estimate 
P ab is to the true value P fib . Denote this function by v a b^ a t>. > p a b.^ * 
Note;that V ab (»,*) depends upon the M0 ab . The value function to be used 
here will be of the squared-error loss type. This is not the only meaningful 
value function to use, but it appears to be the easiest from a computational 
viewpoint. The specific form of the function is (see figure 2), 



This value function is properly used when one is interested in an estimate 

of P . and does not have to choose one of a finite set of actions based 
&D • 

upon the value P ^ . 

In ISDT it is shown for the case of pure binomial sampling (i.e., 
ignoring measurement errors) that the prior value of n^ items from MO^ 

is 



I(n ab> 



K 



V" 

P 



n 



ab 



ab ‘ab. n ab + 



n 



ab 



0,1,2, 



where 



v' 

P ab. 



+ 1 ) 



n 



ab 



n 



ab 



is the variance of the prior estimate of P ^ . Here the prior information 

* 

is contained in the pair ^ n ab ,r ab^* can s ^ own that 1 (n^ ) is strictly 

t 

concave in n^ (see figure 3) • • The next sitep is to take into account the 
errors of measurement inherent to items measuring MO^* It is shown in section 
3.0 that the relationship of the number of items from one MO that gives 
equivalent measurement errors with a fixed number of items from another MO is 
linear. Fixing one MO as a standard, one can generate the ratio of items 
(Yab^ necessary to obtain equivalent information. This yields *^ a b^ n a b^ E 
I(Y a b as the- relative value of n^ items from MO^* Finally, the 

problem to be solved to yield the desired values n^ is 
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Figure 2 
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Figure 3 
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A B 

m3X a-1 S J ab (n ab ) 



such that 



* B 
A a 

Z Z t , n . 
a-1 b-1 ab ab 



L-T 



integral 



Since J ab (» ab ) is strictly concave in n ab , this problem may easily be 
' solved on the computer even for a very large number of MQ's, 
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3,0 Errors of Measurement 



In attempting to measure P^ , there will be two sources of errors 
that can be of different magnitude in different M0 , s. These are the format 
in which the items are written (free-r espouse, multiple choice, or true-false) 
and the generic reliability of the test items. The following discussion will 
demonstrate the linear relationship between the number of items in each of two 
distinct MO's necessary to generate estimates with equivalent errors of measure- 
ment. 



The question of the error variance introduced by the use of multiple 
choice versus free-response items is not an easy one to answer „ There is the 
question as to whether a free-response format of a multiple choice item is the 
same question. .Furthermore, what is the true guessing factor in a multiple 
choice item? If there are m alternatives to an item and the item is 
"perfectly" written, then if the student does not know the answer, he will 
respond correctly with probability 1/m. However, how many teacher-constructed 
items are perfectly written in the sense described above? For the present 
analysis, assume that Sam and Joe are Identical students (in the sense that they 
both know exactly the same material) and will respond to an n item test. 
However, Joe will answer in free-response format while Sam will be able to 
guess with probability 1/m if he does not know the correct response. The 
items given are drawn at random without replacement from the pool of all items 

* 

that measure achievement in a given MO. Let X and Y be the number of 
correct responses given by Joe and Sam respectively, P the true percentage of 
the total item pool that Sam and Joe know, P^ the estimate of P from Joe's 

score and P the estimate of P from Sam's score. Then 

8 
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12 



A x 

p • * 

j n 



- 2 <V 



n 



P(l-P) 



Now 



/** 

P 



X 

n 



1 

m 



8 



1 “ 



1 

m 



as may be seen be writing 



E (i) - p + (l-P) i 
n m 



and substituting y/n for E(y/n) and P for P and 



o 2 (P ) 

8 



- a 



X 
21 n 



1' 

m 



1 - 



1 

m' 



(1 - £) 
m 



o 



Since Sam's probability of a correct response for an item drawn at random is 



P' - P + 



(1-P) T7m 



rU-P) 



- 2 (f) - i [P + (l-P)i] • [l-P - U-P)i]* a 2 (P s ) - i P(l-P) 

m 



Thus, a 2 (P ) - o 2 (P.) - 



8 



j 



1 

A m 

n 



i-A 

m 



(l-P) 



Now one can ask how many test items does Sam have to respond to (n ) in order 

6 

2 ^ 2 a * 

to reduce o (P ) to o*(P.) when Joe responds to n. items. 

sj 3 



or 



1 

o 2 (P.) - P(l-P) - -- P(l-P) + *- 
J “j “ 8 "s 



1 

n s " “i ll + -7 5U-J 

3 a - i>p 

m 



7ZJ ■ ° 2<| . ) 

m 



The above Is the linear relationship necessary to correct for format differ- 
ences, Note that this depends not only upon the number of alternative 
responses (m), but also on the true achievement level (P) of the students. 

The number of additional items necessary to produce sampling errors equivalent 

to free-response format when multiple choice format with m alternatives is 
. . l 

used Is m n ’ Note that as P approaches zero, this goes to infinity. 

(i - i>p 
m 

So multiple choice and true-false items are especially inefficient for estima- 
tion purposes when used as pre-instruction measuring devices. 

The above analysis yields a quite different conclusion from that of 
Lord [5]. In his work, he assumed that Sam and Joe both had the same expected 
score on their tests, i.e., E(X) = E(Y), and came to the correct conclusion 
'■that the standard errors of their scores were equal. However, under his 
assumption, Sam and Joe would be different students with different knowledge 
levels , and therefore not comparable in a fasion Lord indicated. 

There is no reason to believe that the generic reliability (or the 
generic error of measurement) will be the same from one MO to another. A 
discussion of these errors of measurement and methods of estimating them are 
presented in Lord and Novick [6], (Chapters 8 and 9), Cronbach [2], and 
Rajaratnam [8], Once the generic reliabilities have been estimated, the 
Spearman-Brown prophecy formula can be used to derive the linear relationship 
necessary in the item selection algorithm to correct for measurement error 

between MO*s. If MO and MO have generic reliabilities p and p 

s r s r 

respectively for the same number of items n, then the number of items (n ) 

s 

from MO needed to make p equal to p is given by 
s s r 



n « 

8 



p (i-p ) 

p 8 (1 -V 



n 
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4.0 Administration of Tests to Students 

Once the n a ^> a * l, ,,# ,A,b ■ items have been selected from 

MD &b as described in Section 2.0, the L "approximately" parallel forms 
must be constructed. This can be done by arranging the items on a one dimen- 
sional scale according to the chronological presentation to the class of the 
performance objective to which the items are related, and assigning every L~ 
item to the same form. Rigorously speaking, this will result in a violation of 
the time constraint for some o<f the forms. However, the variance among 
students* solution time for each item will be quite large, and unless mu^t of 
the time consuming items are assigned to one form, no gross violations should 
result. Moreover, a few appropriate switches of items should correct any of 
these gross violations. Once the L forms have been constructed, each student 
in the course must be assigned a form to take. . This assignment task should be 
done rigorously as shown below in order to get the most possible information 
out of the test data. 

Consider two partitions of the students in the course. A partition is 

a collection of subsets of all the students such that each student is in one 

and only one subset. The two partitions that probably will be of interest are 

ones based upon achievement levels and physical attributes (different sections, 

teachers, etc.) of the class. For notational purposes, let U^,i = 1,«««,I 

an< * " !>•••»** b® the two partitions. Next, consider the cartesian 

product of the two partitions as shown below. Let W. . « U, H V be those 

ij i j 

students in both U. and V, while m . . >0 is the number of students in 

1 j ij ” 

subset W^. Finally, let a^,* m 1, ,## ,L be the number of students in W ± 

who will respond to form H. It is the a. Jn that must be calculated. The 

ij* 

requirements that all L forms are distributed evenly among partition V 
are 
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( 4 . 1 ) 



I I 

" i-l a ijA* | < 1 Vy' - !,•••, L; j - !,•••, J 



while that of partition U are 



(4.2) 



J 



J 

2 S , . 0 | 

J-l *3* 



4 1 V » 1, •• *,L, i - !,•••,! 



Once the a^^ are f° un ^ that satisfy the above constraints, then one 



randomly selects a^^ students from W^_. and gives them form 1, randomly 



selects another a ^j 2 an< ^ gi ves them form 2, etc. Present research is 



directed toward developing a computer algorithm to find a solution a. . 0 ,i « 1, 

ij * 



« • • 



,I,j - 1,*--,J,A =» 1, 



The above procedure of assigning forms to students is useful for two 



main reasons. Firstly, if the forms are viewed as fixed and it is desired to 



estimate characteristics of the forms within the population of the class, a 
doubly stratified sample of students of approximate size S/L has been drawn 
at random from the class for each form. Thus, for a fixed item on a form, an 
unbiased estimate of its properties in the whole class is obtained, and this 
estimate has lower variance than a simple random sampling estimate due to the 
double stratification. The items are no longer confounded with any student 
or groups of students as defined by the partitions U and V. On the other 
hand, if one views the student groups as fixed and wishes to estimate the 
groups' performance on a given MO, a stratified (by performance objective 
—Section 2.0) sample of items has been drawn within the MO. Furthermore, 
essentially the same items have been given to all groups within the same par- 
tition which enables one to make stronger statistical statements about differ- 



ences between the groups. If different groups receive different sets of items, 
then an additional variance component is introduced into the comparison data, 
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the variance being due to item differences. The actual statistical problems 
are slightly more complex than represented here due -o the fact that within 



j each subgroup (U^ or V.) 

J 

items. 


different students respond to different sets of 
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